GPU Computing in Julia

Biostat/Biomath M257

Author

Dr. Hua Zhou @ UCLA

Published

April 6, 2023

This session introduces GPU computing in Julia.

1 GPGPU

GPUs are ubiquitous in modern computers. Following are GPUs today’s typical computer systems.

NVIDIA GPUs Tesla K80 GTX 1080 GT 650M
Tesla M2090 GTX 580 GT 650M
Computers servers, cluster desktop laptop
Server Desktop Laptop
Main usage scientific computing daily work, gaming daily work
Memory 24 GB 8 GB 1GB
Memory bandwidth 480 GB/sec 320 GB/sec 80GB/sec
Number of cores 4992 2560 384
Processor clock 562 MHz 1.6 GHz 0.9GHz
Peak DP performance 2.91 TFLOPS 257 GFLOPS
Peak SP performance 8.73 TFLOPS 8228 GFLOPS 691Gflops

GPU architecture vs CPU architecture.
* GPUs contain 100s of processing cores on a single card; several cards can fit in a desktop PC
* Each core carries out the same operations in parallel on different input data – single program, multiple data (SPMD) paradigm
* Extremely high arithmetic intensity if one can transfer the data onto and results off of the processors quickly

i7 die Fermi die
Einstein Rain man

2 GPGPU in Julia

GPU support by Julia is under active development. Check JuliaGPU for currently available packages.

There are multiple paradigms to program GPU in Julia, depending on the specific hardware.

  • CUDA is an ecosystem exclusively for Nvidia GPUs. There are extensive CUDA libraries for scientific computing: CuBLAS, CuRAND, CuSparse, CuSolve, CuDNN, …

    The CUDA.jl package allows defining arrays on Nvidia GPUs and overloads many common operations.

  • The AMDGPU.jl package allows defining arrays on AMD GPUs and overloads many common operations.

  • The Metal.jl package allows defining arrays on Apple Silicon and overloads many common operations.

  • The oneAPI.jl package allows defining arrays on Intel GPUs and overloads many common operations.

I’ll illustrate using Metal.jl on my MacBook Pro running MacOS Ventura 13.2.1. It has Apple M2 chip with 38 GPU cores.

versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_EDITOR = code

Load packages:

using Pkg

Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()
  Activating project at `~/Documents/github.com/ucla-biostat-257/2023spring/slides/09-juliagpu`
Status `~/Documents/github.com/ucla-biostat-257/2023spring/slides/09-juliagpu/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.2
  [dde4c033] Metal v0.3.0
  [37e2e46d] LinearAlgebra

3 Query GPU devices in the system

using Metal

Metal.versioninfo()
macOS 13.3.0, Darwin 21.5.0

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1

1 device:
- Apple M2 Max (64.000 KiB allocated)

4 Transfer data between main memory and GPU

# generate data on CPU
x = rand(Float32, 3, 3)
# transfer data form CPU to GPU
xd = MtlArray(x)
3×3 MtlMatrix{Float32}:
 0.104606  0.734883  0.700877
 0.73905   0.471798  0.93666
 0.84818   0.167962  0.942927
# generate array on GPU directly
# yd = Metal.ones(3, 3)
yd = MtlArray(ones(Float32, 3, 3))
3×3 MtlMatrix{Float32}:
 1.0  1.0  1.0
 1.0  1.0  1.0
 1.0  1.0  1.0
# collect data from GPU to CPU
x = collect(xd)
3×3 Matrix{Float32}:
 0.104606  0.734883  0.700877
 0.73905   0.471798  0.93666
 0.84818   0.167962  0.942927

5 Linear algebra

using BenchmarkTools, LinearAlgebra, Random

Random.seed!(257)
n = 1024
# on CPU
x = rand(Float32, n, n)
y = rand(Float32, n, n)
z = zeros(Float32, n, n)
# on GPU
xd = MtlArray(x)
yd = MtlArray(y)
zd = MtlArray(z)

# SP matrix multiplication on GPU
bm_gpu = @benchmark Metal.@sync mul!($zd, $xd, $yd)
BenchmarkTools.Trial: 9374 samples with 1 evaluation.
 Range (minmax):  353.625 μs  1.796 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     503.771 μs                GC (median):    0.00%
 Time  (mean ± σ):   531.680 μs ± 109.258 μs   GC (mean ± σ):  0.00% ± 0.00%
         ▁▇  ▅▁▁▃▁ ▁█     ▃▆▄            ▄▄                     
  ▂▂▄▃▄▅███▇██████▇██▅▃▃▇███▇▇▆▃▂▂▁▁▂▃▅████▄▄▂▂▁▁▁▁▁▁▁▁▁▂▂▂▂▁ ▄
  354 μs           Histogram: frequency by time          825 μs <
 Memory estimate: 800 bytes, allocs estimate: 40.
# SP throughput on GPU
(2n^3) / (minimum(bm_gpu.times) / 1e9)
6.072771008837045e12
# SP matrix multiplication on CPU
bm_cpu = @benchmark mul!($z, $x, $y)
BenchmarkTools.Trial: 1655 samples with 1 evaluation.
 Range (minmax):  2.937 ms 10.839 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     3.004 ms                GC (median):    0.00%
 Time  (mean ± σ):   3.021 ms ± 263.584 μs   GC (mean ± σ):  0.00% ± 0.00%
                   ▁▄▆▆█▆▁                                    
  ▂▂▂▂▂▂▂▂▂▃▃▃▃▃▃▅████████▆▅▄▅▄▂▃▂▂▂▂▂▂▂▂▂▁▁▁▁▂▁▂▁▁▂▁▁▁▁▁▂▂ ▃
  2.94 ms         Histogram: frequency by time        3.12 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.
# SP throughput on CPU
(2n^3) / (minimum(bm_cpu.times) / 1e9)
7.312449639908062e11

We see ~10x speedup by GPUs in this matrix multiplication example.

# cholesky on Gram matrix
# This one doesn't seem to work on Apple M2 chip yet
# xtxd = xd'xd + I
# @benchmark Metal.@sync cholesky($(xtxd))
# xtx = collect(xtxd)
# @benchmark cholesky($(Symmetric(xtx)))

GPU speedup of Cholesky seems unavailable at the moment.

6 Elementiwise operations on GPU

# elementwise function on GPU arrays
fill!(yd, 1)
bm_gpu = @benchmark Metal.@sync $zd .= log.($yd .+ sin.($xd))
bm_gpu
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  363.125 μs 20.717 ms   GC (min … max): 0.00% … 80.59%
 Time  (median):     449.979 μs                GC (median):    0.00%
 Time  (mean ± σ):   462.875 μs ± 211.103 μs   GC (mean ± σ):  0.36% ±  0.81%
            ▂▅▆█▆▆▃▄▁▂▂▃▃▃▂▁                                   
  ▁▁▂▂▃▅▆▇▇▇█████████████████▇█▇████▆▆▅▃▃▂▂▂▂▂▃▂▃▃▄▄▄▄▃▄▃▂▂▂▂ ▄
  363 μs           Histogram: frequency by time          602 μs <
 Memory estimate: 8.13 KiB, allocs estimate: 310.
# elementwise function on CPU arrays
x, y, z = collect(xd), collect(yd), collect(zd)
bm_cpu = @benchmark $z .= log.($y .+ sin.($x))
bm_cpu
BenchmarkTools.Trial: 538 samples with 1 evaluation.
 Range (minmax):  9.213 ms 9.853 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.256 ms               GC (median):    0.00%
 Time  (mean ± σ):   9.292 ms ± 95.644 μs   GC (mean ± σ):  0.00% ± 0.00%
  ▂▆▇█▇▇▅▅▄▄▃ ▁ ▂▂                                          
  ███████████▇█▇███▇▇▇▄▆█▆▆▄▆▆▇▆▄▆▁▄▄▆▆▇▄▄▄▄▁▇▁▇▆▆▄▄▄▁▆▄▄▆ █
  9.21 ms      Histogram: log(frequency) by time     9.64 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.
# Speed up
median(bm_cpu.times) / median(bm_gpu.times)
20.570798225252485

GPU brings great speedup (>20x) to the massive evaluation of elementary math functions.